Introduction

Nowadays, we are stepping into an era of information explosion. Social media received much more attention than traditional media because it creates a convenient environment for people from different places to share information and make connections. Twitter is one of the most popular social media platforms with 237.8 million monetizable daily active users. Every second, there are countless posts on Twitter expressing opinions and emotions about a variety of topics. Twitter is featured by using hashtags, a combination of a # symbol and index keywords, to help its users easily find and follow the topics they are interested in. Another distinguished feature of Twitter is Twitter Trends, which detects the most popular always-changing trending topics in different countries. Twitter also provides an opportunity for users to retrieve tweets through Twitter API without opening the application. It involves counting methods and various machine learning algorithms to identify trending topics on Twitter through extracting targeted key words (Rodrigues et al. 2021).

The biggest search engine Google also provides a free tool, Google Trends, for analyzing the popularity of Google search terms using real-time data. Similar to Twitter API, Google Trends allows users to capture what people are searching in different periods of time, season, and location. The main purpose of Google Trends is providing users with key insights about the volume of Google searches related to specific search terms, as well as showing the relative popularity of these search queries based on geographical locations.

Twitter API and Google Trends enable us to easily extract the specific information and get deeper into the topics we care about. When we started thinking about what topic we are going to analyze in this project, we noticed that despite the regular explore tabs “News”, “Sports”, and “Entertainment”, there is a new tab “World Cup” at the top of the Twitter homepage because of the ongoing 2022 Qatar World Cup, which is the most viewed and followed sporting event in the world and now dominating online discussions, especially in Twitter. Elon Musk revealed that Twitter traffic related to the 2022 Qatar World Cup almost hit 20,000 tweets per second. It aroused our interest to analyze the sentiment related to the topic “World Cup” and how people from worldwide or a specific region respond to it. When users click the tab “World Cup”, they can see several featured Tweets and accounts as they scroll down the page. However, unlike other tabs, there’s no trending hashtags or topics available. Thus, we tracked Tweets before and after each quarter-final match both around the world and within the U.S.. By comparing the trends in the two areas, we can see how people in different areas react differently to the World Cup. The link to this project’s github repo is https://github.com/EEElisa/WorldCupAnalysis.git.

Data

In this project, the data sources are Twitter and Google Trends. As for Twitter, we gathered and cached Tweets at several time points during the quarter-finals of the World Cup (especially before and after each match) with search scope restricted to the U.S. and the whole world. As for Google Trends, we accessed the search volumes of “World Cup” with the same two search scopes, U.S. and the world.

Data Gathering

Caching Tweets

The main data source is the Tweets gathered throughout the game season until the end of Quarter-Finals on 2022-12-10. We use the function “search_tweets” from rtweet to cache Tweets before and after each Quarter-Final game using the search query “WorldCup OR worldcup” and save the results into local csv files with the search time in the file name.

By running the following chunk, users will cache 1000 recent Tweets containing the search query “WorldCup OR worldcup” into a local csv file in the folder named “cache”.The file name includes the search time for convenience of future analysis.

cache_path <- paste(getwd(),"/cache/",sep="")

auth_setup_default()
keywords_tweets <- "WorldCup OR worldcup"
usa <- lookup_coords("usa")

current_time <- Sys.time()
us_tweets <- search_tweets(keywords_tweets, geocode=usa, lang="en", n=1000)
us_tweets$searched_at <- current_time 
us_tweets <- us_tweets %>%
  select(full_text, searched_at)

filename_US <- paste("US-",current_time,".csv",sep='')
path_US <- paste(cache_path,filename_US,sep='')
write.table(us_tweets, path_US, row.names=TRUE)

current_time <- Sys.time()
world_tweets <- search_tweets(keywords_tweets, lang="en",n=1000)
world_tweets$searched_at <- current_time 
world_tweets <- world_tweets %>%
  select(full_text, searched_at)

filename_world <- paste("World-",current_time,".csv",sep='')
path_world <- paste(cache_path,filename_world,sep='')
write.table(world_tweets, path_world, row.names=TRUE)

Combine the cached files

We read all the previously cached files into one single data frame for further analysis.

# read cached files
read_cache <- function(file_path){
  tweets_total <- data.frame()
  n = length(file_path)
  for (i in 1:n) {
    filepath <- paste(base_path, file_path[i], sep="")
    tweets <- read.csv(filepath)
    tweets_total <- rbind(tweets_total,tweets) 
  }
  colnames(tweets_total) <- c("value")
  return(tweets_total)
}

us_tweets_total <- read_cache(us_files)
n_us <- length(us_tweets_total)

world_tweets_total <- read_cache(world_files)
n_world <- length(world_tweets_total)

Results

Text Mining of Tweets

Sentiment Analysis

For sentiment analysis, we generated a courpus using the Tweeted collected before. We cleaned the corpus by removing numbers, punctuation, white space, capital letters, stop words, etc. Regarding the stop words, we included the words appearing in the search query such as “worldcup”, “word”, and “cup”. Also, we added other words that were highly related to the World Cup itself rather than sentiments in addition to the basic English stop words given by the function “stopwords(”english”)“. Finally, we passed the preprocessed corpus into funtion”TermDocumentMatrix” to create a term document matrix. For repeated usa, we wrote a function named “preprocess_into_tdm” to generate the term document matrix given Tweets text.

# function to preprocess the Tweets and transform it into words
preprocess_into_tdm <- function(Tweets){
  tweets.corpus <- Corpus(VectorSource(Tweets)) %>%
  tm_map(removeNumbers) %>% # removes numbers from text
  tm_map(removePunctuation) %>% # removes punctuation from text
  tm_map(stripWhitespace) %>% # trims the text of whitespace
  tm_map(content_transformer(tolower)) %>% # convert text to lowercase
  tm_map(removeWords,stopwords) %>% # remove stopwords
  tm_map(removeWords,stopwords)# remove stopwords not removed from previous line
  tdm <- TermDocumentMatrix(tweets.corpus) %>% # create a term document matrix
  as.matrix()
  return(tdm)
}

With the term document matrix at hand, we conducted the sentiment analysis then. In this part, we used the function “get_sentiments(”nrc”)” from the package “tidytext” to get a dictionary where 13,872 words were assigned a proper sentiment. We extracted unique words from the term document matrix and counted the occurrences of each word into a tibble. By conducting an inner join between the word table and the nrc sentiment table, the self-defined function “sentiment_analysis(tdm)” will return the table of emotions and their percentage.

# function to conduct sentiment analysis 
sentiment_analysis <- function(tdm) {
  words <- unique(data.frame(word = names(sort(rowSums(tdm), decreasing = TRUE))))
  words <- as_tibble(words)
  senti = inner_join(words, get_sentiments("nrc"),by=c("word"="word")) %>%
  count(sentiment)
  senti$percent = (senti$n/sum(senti$n))*100
  return(senti)
}

By comparing the results of two sentiment analysis (using Tweets during the quarter-finals from U.S. and the world), we can see that the overall distribution of sentiments are similar. For instance, “positive” is at top of the rank, followed by “negative” and “trust”, and “sadness”, “surprise” and “disgust” are three least frequent emotions. However, there do exist several differences. The percentage of “joy” in U.S. was higher than the worldwide Tweets while the percentage of “anticipate” in U.S. was slightly lower than the world’s result. The results demonstrates that the distributions of sentiments within the two days are different from each other. Moreover, the analysis can be done on an ongoing basis as the game season progresses.

Conclusion and Discussion

This study aims to use Twitter API and Google Trends to extract trending words and topics with respect to the ongoing 2022 Qatar World Cup, as well as to conduct sentiment analysis. According to the above graphs, we can conclude that there is a discrepancy between US Twitter trends and worldwide trends, which can be supported by multiple perspectives.

By comparing the trending hashtags during the quarter-finals, we can see that the trending hashtags of are relatively lagging behind the match schedule while the ones of U.S. reflect a real-time update. As for the sentiment analysis, despite the overall similar pattern, there are some differences that are worth noting. The percentage of “joy” in U.S. was higher than the worldwide Tweets but the sentiment “anticipate” was slightly less popular than the world’s result. The word cloud is another straightforward display of the most popular words in chosen time points. For example, after Croatia knocked Brazil out of the World Cup, the word cloud shows that the most frequent words appearing in Tweets posted by users in the United States were “brazil ’’ and the emoji of Croatia’s national flag. It indicates that these users followed up-to-date information and posted the tweets in real-time. By contrast, when analyzing the worldwide trend at the same time after Brazil was beaten by Croatia, we surprisingly found that the keywords related to WorldCup were not up-to-date with the ongoing of the match. Instead, the most frequent words were “win”, “earn”, “argentina”, and “netherlands”, which was more related to the earlier matches. In that case, we can conclude that Tweets in the United States followed up-to-date information about World Cup matches while Tweets in the worldwide range seemed to have a delay in information and not about the latest matches.

The limitations of this study mainly come from the data sources. Firstly, the sample size of the U.S. and the world are largely unequal. Even though we took the difference into account and used the percentage rather than absolute values for comparisons, it’s still a significant source of bias. Secondly, the cached Tweets were gathered only during the quarter-finals but the search volumes given by Google Trends indicates that the popularity of World Cup reached the peak at around the end of November. If more data were available for the analysis, the results can contain more insights. Thirdly, the study doesn’t explain the distinctions of sentiment or trending hashtags between two sets of Tweets. Further studies can examine the popular words related to each sentiment. Moreover, the analysis can be generalized to compare the sentiment and trends among different states in U.S. or different countries around the world.

References

Rodrigues, Anisha P, Roshan Fernandes, Adarsh Bhandary, Asha C Shenoy, Ashwanth Shetty, and M Anisha. 2021. “Real-Time Twitter Trend Analysis Using Big Data Analytics and Machine Learning Techniques.” Wireless Communications and Mobile Computing 2021.